Mapping of Sequence Reads to the Reference Genomes ◾ 77
algorithm uses uncompressed suffix array (SA) data structure to perform sequential
maximum mappable seed search, which is defined by the developer as the longest sub-
string of a read that matches exactly one or more substrings of the reference genome. This
search is achieved by mapping seeds to the reference genome. A read with a splice junction
site will not be mapped continuously. The algorithm will try to align the first unmapped
seed to a donor splice site and then it repeats the search and aligns the unmapped to an
acceptor splice site. The search is performed for forward and reverse direction. This kind
of search will help in the detection of base mismatches and InDels. If a single or multiple
mismatches are found, the matched substrings will act as anchors on the genome to allow
extension. The search is then followed by a seed clustering by proximity for determining
the anchor seeds. Then, the aligned seeds around the anchor seeds within a user-defined
window are stitched together using dynamic programming. STAR is capable of detecting
splices and chimeric transcripts and mapping complete RNA transcripts that are formed
from non-contiguous exons in eukaryotes [8].
The STAR software can be installed by following the installation instructions, which are
available at “https://github.com/alexdobin/STAR”. On Ubuntu, you can install STAR using
the following command:
sudo apt install rna-star
As most of the read aligners, STAR basic workflow includes both index generation and read
alignment. However, for index generation, both a reference genome in the FASTA format
and reference annotation file in GTF format are required. Pre-built indexes for genomes
of some species can be downloaded from the STAR official website. As discussed before,
the reference genomes can be downloaded from databases such as NCBI Assembly, UCSC
genome collection, or any other database. For the aligners discussed before, we down-
loaded the human reference genome from the NCBI Genome database. For STAR, we will
download the human reference genome and its GTF annotation file from the UCSC data-
base. The reason is that UCSC maintains the gene annotation file in GTF format. Use the
following command to create a new directory “ucscref” and then download and decom-
press the human reference genome and GTF annotation file:
mkdir ucscref
wget \
-O “ucscref/hg38.fa.gz” \
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/
hg38.fa.gz
wget \
-O “ucscref/hg38.fa.gz” \
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
hg38.ncbiRefSeq.gtf.gz
gzip -d ucscref/hg38.fa.gz
gzip -d ucscref/hg38.ncbiRefSeq.gtf.gz
Then, we will build the index for the reference genome using the “STAR” command.